In 2018, Capital Metropoitan Transportation Authority (CapMetro), a public transportation agency serving Austin, Travis and parts of Williamson Counties, launched the “Cap Remap” by redesigning the bus network of the city as part of its transit development plan, Connections 2025. Cap Remap adjusted the transit network according to internal analysis and community outreach and aims to provide more fewquent, more reliable, and better connected bus system. Specifically, it remapped certain routes, tripled the number of bus routes that operate every 15 minutes, and made sure the frequency meets the need on weekends.
Given the renewed interest in bus transit in US cities, we think there is an opportunity to innovate in the realm of bus planning using modern data science methods. The goal of this article is to present a tool for planners to test how changes of built environment and/or bus route information impact bus ridership in Austin.
We will start with an evaluation of the Cap Remap to further understand the trend, pattern, and charateristics of the ridership in Austin. Then, we will use the Automated Passenger Counter data and other open data to develope a machine learning model, followed with accuracy result and validation across different regions.
The data we used come from the Automated Passenger Counter(APC), which counts the number of boarding and alighting on any given bus. The image below illustrate the APC system at work. source
The data structure is………….
The data here used are disaggregated data in May 21-25th and June 11-15th in 2018, which are one week before and after the CapRemap. The disaggregated data consist of passenger and bus information for each route, each shift, and each stop. Basically, the number of passenger boarding, alighting, and other related ridership information are recorded at each stop for each shift of each bus route.
How did ridership change before and after CapRemap (06/03, 2018)?
Current available data from Capital Metro allows us to observe the trend in ridership change before and after Cap Remap. The first important part of exploratory analysis is to see the city-wide change in ridership brought by CapRemap. With the stop-level data from Janurary 207 to November 2019, the aggregated city-wide ridership change is shown in the chart below.
……………………need re-edit these………………. The x-axis represents month, and y-axis repredents the average daily ridership in the given month. The solid lines in the chart are 2018 riderships. The yellow solid line is ridership from Janurary to May in 2018 (before Cap Remap happened in June 2018) while the blue solid line is ridership from June to December in 2018 (after CapRemap). The dashed line is the ridership in 2019 from Janurary to June after CapRemap happend the year before.
From the trend in 2018, it is clear that ridership fluctuated between months. Cap Remap didn’t bring a rapid increase in ridership after the implementation. On the contrary, the ridership decreased in June and July. In August, the ridership recovers to the previous level before CapRemap, Then in September the ridership almost exploded to 0.1 million. Then it gradually went down in winter but in 2019, the ridership is generally higher than the same month in 2018. This result shows the general positive impact CapRemap brought to ridership change. For the decrease in June and July and following increasing trend, people might need time to adjust to the new bus schedule and get used to it. And after they realize the convinience of the redesign, the ridership increased rapidly. Another explanation is related to university’s opening and closing as the low ridership happened in summer break and high ridership happened in the beginning of the new semester. ……………………………………………………..S
After knowing the trend of city-wide ridership change, the next question is how the ridership changed across the city: which area experienced ridership increase and which area exprienced ridership decrease. Neighborhoods in Austin are used here to show the spatial trend here.
As shown in the map, darker blue represents higher ridership increase, darker red presents lower ridership increase or even ridership decrease. As shown in the map, mostly downtown areas experienced ridership increase from June to September while the outskirts of Austin experienced low ridership increase or even ridership decrease.
We then created charts showing the ridership change in each neighborhood in 2018 in June and September. There are 12 neighborhoods experienced ridership decrease from June to September. There are several neighborhoods experienced high ridership increase of more than 10,000 from June to September. Generally, most neighborhoods experienced ridership increase after CapRemap from June to September. Among the 78 neighborhoods in Austin, we identified three neighborhoods that represents different characteristics: neighborhoods with expected ridership increase; neighborhoods with unexpected ridership increase; neighborhoods with unexpected ridership decrease.
UT is the neighborhood with expected ridership increase.The location of UT neighborhood is just above downtown neighborhood. With a lot of university students living around here, the bus network is sensitive to school schedule. There is a relatively clear trend in ridership change according to school seasons.
The second neighborhood Govalle is the neighborhood that experiencnig unexpected ridership increase. After CapRemap, the ridership in Govalle nearly increased by 50% to 75%. As Govalle is closer to the outskirts of Austin, this ridership increase might reflects CapRemap’success in strengthening the east-west connection.
But there are also neighborhoods exepriencing ridership decrease on the east-west direction. Zilker located in the southwest side of Austin’s downtown region. Its ridership experienced a gradually slight decrease after CapRemap.
What could potentially influence ridership in terms of route information?
There are several route types, each serving different purposes. Our hypothesis is that they will play an important role in determing the ridership.
……………………. need more description on route types talk about the service area as base map ……………………..
What makes a good bus system? What’s so special about the ‘hotlines’?
The following analysis aim to find out what routes are popular, why are they popular, and how they have changed in a micro perspective. Kmeans Cluster Analysis was used to separate the disaggregated data into groups. Kmeans is an unsupervised learning algorithm that automatically group the dataset based on the distribution of each feature. We intend to use this algorithm to see if the resulting grouping identifies the hotlines, i.e. the routes that have higher ridership.
We looked at the Kmeans analysis both before and after the Cap Remap. We get 4 lines labeled as hotlines before the remap, 6 lines labeled as hotlines after the remap. The hotlines before and after the remap are plotted below. Most of the hot routes are north-south direction. There are two new hotlines emerged after the CapRemap, line 10 and line 20, and they are colored in red.
To dive deeper into the characteristics of the hot bus lines, we map out the passenger load for each route at each stop for each direction. We also ploted the passenger load versus stop sequence ID as well as average boarding and alighting at each stop along each route. The purpose of this analysis is to first, find out what is so special about the hotlines, and second, see trends before and after the Cap Remap. Note that the Austin bus system has different patterns for each route, and in order to make sure the plots to make sense, we only selected the most used pattern for each plot. Below we chose two Line 20 (type High Frequency) and Line 801(Metro Rapid) to demonstrate detailed route analysis.
By mapping and plotting the average passenger number on bus as well as the average boarding and alighting at each stop, we can see better how specific location or neighborhood could potentially contribute to the ridership. These regions will be feature engineered in the following analysis. We also noticed that ridership tends to be higher in the middle portion of the trip, this means a lot of the passengers board from early stops to stops near the ends.
In conclusion, hotlines have the following characteristics:
……………need more text in this section……………..
…………5 types of features………. ……….will use what model………..
…………..some more feature exploring……………
………………feature correlation matrix………….
We group the disaggregated data by routes, and calculated the max and mean number of passengers on bus at each stop, the average miles traveled and the average hours spent for each passenger at each stop, as well as the total run length and total run time of the route, which bacame the features we used for KMeanse Analysis.
# package arranged by alphabetical orders
library(areal)
library(caret)
library(class)
library(ckanr)
library(corrplot)
library(data.table)
library(dplyr)
library(FNN)
library(flexclust)
library(geojsonsf)
library(gganimate)
library(ggmap)
library(ggplot2)
library(ggrepel)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(leaflet)
library(lubridate)
library(magrittr)
library(MASS)
library(metR)
library(NbClust)
library(nnet)
library(nngeo)
library(openxlsx)
library(osmdata)
library(plotly)
library(QuantPsyc)
library(RColorBrewer)
library(reshape2)
library(RSocrata)
library(sf)
library(spatstat)
library(spdep)
library(stargazer)
library(tidycensus)
library(tidyr)
library(tidyverse)
library(tigris)
library(viridis)mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
axis.ticks=element_blank())
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=1.5),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
nn_function <- function(measureFrom,measureTo,k) {
nn <-
FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint)
return(output)
}setwd('C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801')
agg <- read.csv('D:/Spring20/Practicum/data/MUSA Data - Stop Ridership Aggregated.csv')
disagg <- read.csv('D:/Spring20/Practicum/data/MUSA Disagregated Data Sample 01-06-2020 to 01-10-2020.csv')
austin <- st_read('https://data.austintexas.gov/api/geospatial/3pzb-6mbr?method=export&format=GeoJSON')
Routes1801 <- st_read("Data/Shapefiles_-_JANUARY_2018/Routes.shp")
serviceArea <- st_read('Data/Shapefiles_-_JUNE_2018/Service_Area.shp')
NewRoutes <- st_read('Data/Jia/NewRoutes.shp')
HighFreq <- st_read('Data/Jia/HighFrequency.shp')
Replaced <- st_read('Data/Jia/EliminatedReplacement.shp')
Eliminated <- st_read('Data/Jia/Eliminated.shp')
Routes2001 <- st_read('Data/Jia/Routes2001.shp')
stops <- st_read("C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801/Data/Shapefiles_-_JUNE_2018/Stops.shp")%>%
st_transform(2278)Routes1801 <- Routes1801%>%
mutate(capremap = "Before Cap Remap")
Routes2001 <- Routes2001%>%
mutate(capremap = "After Cap Remap")#####Data Structure#####
#We use aggregated data to look at the average ridership on weekdays at individual stops
ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
geom_sf(data = subset(agg_after_sf, STOP_ID == 476), aes(color = "Stop 476"), size = 2, show.legend = "point")+
scale_colour_manual(values = c("Stop 476" = "darkorange"),
guide = guide_legend("Aggregated Data Example"))+
labs(title = "Aggregated Data Structure",
subtitle = "Data from Capital Metro")+
ggrepel::geom_label_repel(
data = subset(agg_after_sf, STOP_ID == 476),aes(label = "Average Ridership = 33 \n Average Passing Buses = 55", geometry = geometry),
stat = "sf_coordinates",
min.segment.length = 3)
#We use disaggregated data to investigate the average ridership on weekdays on different routes.
disagg_803 <- subset(disagg_sf, ROUTE == 803)%>%
group_by(STOP_ID)%>%
summarize(avg_on = mean(PSGR_ON),
avg_load = mean(PSGR_LOAD))
ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
geom_sf(data = disagg_803, aes(color = "Stops on Route 803"), size = 2, show.legend = "point")+
scale_colour_manual(values = c("Stops on Route 803" = "darkorange"),
guide = guide_legend("Disggregated Data Example"))+
labs(title = "Disaggregated Data Structure",
subtitle = "Data from Capital Metro")+
geom_label_repel(
data = subset(disagg_803, STOP_ID == 2606),aes(label = "Average On-board Passengers of Stop 2606 = 11 \n Route Type = Metro Rapid", geometry = geometry),
stat = "sf_coordinates",
min.segment.length = 0,
segment.color = "lightgrey",
point.padding = 20)ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = Eliminated, aes(color = "Eliminated Routes"), lwd = 0.5, show.legend = "line")+
geom_sf(data = Replaced, aes(color = "Eliminated but Replaced Routes"),lwd = 0.5, show.legend = "line")+
geom_sf(data = HighFreq, aes(color = "High Frequency Routes"), lwd = 0.8, show.legend = "line")+
geom_sf(data = NewRoutes, aes(color = "New Routes"),lwd = 0.8, show.legend = "line")+
scale_colour_manual(values = c("Eliminated Routes" = "darkorange", "Eliminated but Replaced Routes" = "gold", "High Frequency Routes" = "dodgerblue", "New Routes" = "deeppink"),
guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid", "solid", "solid"), shape = c(NA, NA, NA, NA))))+
labs(title = "Cap Remap Route Changes",
subtitle = "City of Austin, June 2018")local <- ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "Local"), color = "lightblue2",lwd = 0.8,show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "Local"), color = "lightblue2",lwd = 0.8,show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "lightblue2", "After Cap Remap" = "lightblue2"),
#guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "Local Routes Before and After Cap Remap")
#HighFrequency
highFrequency <- ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "High Frequency"), color = "dodgerblue",lwd = 0.8,show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "High Frequency"), color = "dodgerblue",lwd = 0.8,show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "dodgerblue", "After Cap Remap" = "dodgerblue"),
#guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "High Frequency Routes Before and After Cap Remap")
#major changes grid arrange
grid.arrange(local, highFrequency, ncol = 1)#Crosstown
crosstown <-ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "greenyellow", "After Cap Remap" = "greenyellow"),
#guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "Crosstown Routes Before and After Cap Remap")
#Feeder
feeder <-ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "lightcoral", "After Cap Remap" = "lightcoral"))+
#guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "Feeder Routes Before and After Cap Remap")
#Flyer
flyer <- ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "magenta2", "After Cap Remap" = "magenta2"),
# guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "Flyer Routes Before and After Cap Remap")
#Express
express <-ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "red", "After Cap Remap" = "red"),
#guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "Express Routes Before and After Cap Remap")
#Special
special <- ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "seashell2", "After Cap Remap" = "seashell2"),
# guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "Speical Routes Before and After Cap Remap")
#minor changes grid arrange
grid.arrange(crosstown, feeder, flyer, express, ncol =2)#UT Shuttle
UTShuttle <- ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "UT Shuttle"), color = "orange",lwd = 0.8,show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "UT Shuttle"), color = "orange",lwd = 0.8,show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "orange", "After Cap Remap" = "orange"),
#guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "UT Shuttle Before and After Cap Remap")
#Nigh Owl
nightowl <- ggplot()+
geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
geom_sf(data = subset(Routes1801, ROUTETYPE == "Night Owl"), color = "slategray2",lwd = 0.8,show.legend = FALSE)+
geom_sf(data = subset(Routes2001, ROUTETYPE == "Night Owl"), color = "slategray2",lwd = 0.8,show.legend = FALSE)+
#scale_colour_manual(values = c("Before Cap Remap" = "slategray2", "After Cap Remap" = "slategray2"),
# guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
facet_grid(~capremap)+
labs(title = "Nigh Owl Routes Before and After Cap Remap")
#no change grid arrange
grid.arrange(UTShuttle, special, nightowl, ncol = 2)setwd('C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801')
data = read.csv('Data/MUSA Data - Stop Ridership Aggregated.csv')
table(data$YEAR_ID)
#Subset data in 2018 for further analysis
data.y18 <- data %>%
subset(data$YEAR_ID == "2018")#Change June to 0 and make months before CapRemap become negative, after CapRemap become positive
data.y18$MONTH_ID <- as.numeric(data.y18$MONTH_ID)
data.y18$Month <- data.y18$MONTH_ID - 6
#Make month column become factor
#data.y18$Month <- factor( data.y18$Month, levels = c( "-5", "-4", "-3", "-2", "-1", "0", "1", "2", "3", "4", "5", "6") )
data.y18$Before <- ifelse(data.y18$Month < 0, 1, 0)
data.y19 <- data %>%
subset(data$YEAR_ID == 2019)
data.y19$Month <- data.y19$MONTH_ID - 6
data.y19$Before <- ifelse(data.y19$Month < -5, 1, 0)
data.y <- rbind(data.y18, data.y19)
data.y$YEAR_ID <- as.factor(data.y$YEAR_ID)
table(data$YEAR_ID, data$MONTH_ID)#Plot Average_on first
plot.city<-
as.data.frame(data.y) %>%
group_by(Month, YEAR_ID) %>%
summarize(BOARD_Count = sum(AVERAGE_ON), Time = as.factor(max(Before))) %>%
ggplot(aes(x=Month,y=BOARD_Count, colour = Time, linetype = YEAR_ID)) +
geom_point() + stat_smooth(size=1) +
plotTheme() +
ylim(70000,100000) +
labs(title="Ridership by stops on an average weekday among all the stops in Austin",
subtitle="CapRemap Redesign Implemented Month in Blue", x="Month", y="Average Daily Ridership")+
geom_vline(xintercept = 0, color = "blue")+
scale_colour_manual(values = c("#E7B800", "#0072B2"), name="Time Period (Before,After)", breaks=c("0", "1"), labels=c("After", "Before"))+
# scale_color_brewer(palette = "YlGnBu")
scale_linetype_manual(values=c("solid", "dotted"))
plot.cityplot.all2<-
as.data.frame(data_final) %>%
group_by(Month, YEAR_ID) %>%
summarize(BOARD_Count = sum(AVERAGE_ON), Time = as.factor(max(Before))) %>%
ggplot(aes(x=Month,y=BOARD_Count, colour = YEAR_ID, linetype = Time)) +
geom_point() + stat_smooth(size=1) +
plotTheme() +
ylim(50000,170000) +
labs(title="System-wide Ridership on an average weekday among all the stops in Austin",
subtitle="CapRemap Redesign Implemented Month in Blue", x="Month", y="Average Daily Ridership")+
geom_vline(xintercept = 0, color = "blue")+
scale_colour_manual(values = c("#E7B800", "#0072B2", "#d95f02"), name="Year", breaks=c("2017", "2018","2019"), labels=c("2017", "2018","2019"))+
# scale_color_brewer(palette = "YlGnBu")
scale_linetype_manual(values=c("solid", "dashed"), name="Time Period (Before,After)", breaks=c("0", "1"), labels=c("After", "Before"))#Transform ridership data into geo data
data.y18.sf <-
data.y18 %>%
st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
st_transform(2278)
data.y.sf <-
data.y %>%
st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
st_transform(2278)
#Load neighborhood geojson
nhood <- st_read("https://data.austintexas.gov/resource/nz5f-3t2e.geojson")%>%
st_set_crs(4326)%>%
st_transform(2278)#Load the transformed dataset from long format-stop level to long format-neighborhood level
setwd('C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801')
Rider_nhood4 <- read.csv("Data/GRACE/Rider_nhood4.csv")
Rider_nhood4$Before <- ifelse(Rider_nhood4$Month == -5 |Rider_nhood4$Month == -4 |Rider_nhood4$Month == -3|Rider_nhood4$Month == -2|Rider_nhood4$Month == -1, "1", "0")
Rider_nhood <- read.csv("Data/GRACE/Rider_nhood.csv")
Rider <- merge(Rider_nhood4, nhood, by = "label")ggplot() +
geom_sf(data = Rider, aes(fill = q5(Dif))) +
scale_fill_brewer(palette = "RdYlBu",
aesthetics = "fill",
labels=qBr(Rider,"Dif"),
name="Quintile\nBreaks") +
labs(title="Ridership Change in Neighborhoods") +
mapTheme()plot.nhood <-
as.data.frame(Rider_nhood4) %>%
arrange(desc(Dif))%>%
ggplot(aes(x=Month,y=BOARD_Count, colour = Before)) +
geom_point() + stat_smooth(size=1) +
plotTheme() +
facet_wrap(Dif~label,scales="free", ncol=4) +
labs(title="Average Daily Ridership by Neighborhoods in Austin",
subtitle="CapRemap Redesign Implemented Month in Blue", x="Month", y="Average Daily Ridership in the Neighborhood")+
geom_vline(xintercept = 0, color = "blue")+
scale_colour_manual(values = c("#0072B2", "#E7B800"),name="Time Period (Before,After)", breaks=c("0", "1"), labels=c("After", "Before"))
plot.nhoodUT <- Rider_nhood4 %>%
subset(Rider_nhood4$label == "UT")
plot.UT <-
as.data.frame(UT) %>%
arrange(desc(Dif))%>%
ggplot(aes(x=Month,y=BOARD_Count, colour = Before)) +
geom_point() + stat_smooth(size=1) +
plotTheme() +
facet_wrap(Dif~label,scales="free", ncol=4) +
labs(title="Average Daily Ridership in UT Neighborhood",
subtitle="CapRemap Redesign Implemented Month in Blue", x="Month", y="Average Daily Ridership in UT Neighborhood")+
geom_vline(xintercept = 0, color = "blue")+
scale_colour_manual(values = c("#0072B2", "#E7B800"),name="Time Period (Before,After)", breaks=c("0", "1"), labels=c("After", "Before"))nhood <- st_read("https://data.austintexas.gov/resource/nz5f-3t2e.geojson")%>%
st_set_crs(4326)%>%
st_transform(2278)
UT2 <- nhood%>%
subset(nhood$label == "UT")
Map.UT<-
ggplot() +
geom_sf(data = nhood, fill = "grey30") +
geom_sf(data = UT2, fill = "#56B4E9") +
labs(title = "UT Neighborhoods") +
mapTheme() + theme(legend.position = "bottom")print(plot.UT)
#print(Map.UT, vp=viewport(.85,.815,.35,.5))
print(Map.UT, vp=viewport(.85,.72,.4,.42))Govalle <- Rider_nhood4 %>%
subset(Rider_nhood4$label == "Govalle")
plot.Govalle <-
as.data.frame(Govalle) %>%
arrange(desc(Dif))%>%
ggplot(aes(x=Month,y=BOARD_Count, colour = Before)) +
geom_point() + stat_smooth(size=1) +
plotTheme() +
facet_wrap(Dif~label,scales="free", ncol=4) +
labs(title="Average Daily Ridership in Govalle Neighborhood",
subtitle="CapRemap Redesign Implemented Month in Blue", x="Month", y="Average Daily Ridership in Govalle Neighborhood")+
geom_vline(xintercept = 0, color = "blue")+
scale_colour_manual(values = c("#0072B2", "#E7B800"),name="Time Period (Before,After)", breaks=c("0", "1"), labels=c("After", "Before"))Govalle2 <- nhood%>%
subset(nhood$label == "Govalle")
Map.Govalle<-
ggplot() +
geom_sf(data = nhood, fill = "grey30") +
geom_sf(data = Govalle2, fill = "#56B4E9") +
# labs(title = "Govalle Neighborhoods") +
mapTheme() + theme(legend.position = "bottom")Zilker <- Rider_nhood4 %>%
subset(Rider_nhood4$label == "Zilker")
plot.Zilker <-
as.data.frame(Zilker) %>%
arrange(desc(Dif))%>%
ggplot(aes(x=Month,y=BOARD_Count, colour = Before)) +
geom_point() + stat_smooth(size=1) +
plotTheme() +
facet_wrap(Dif~label,scales="free", ncol=4) +
labs(title="Average Daily Ridership in Zilker Neighborhood",
subtitle="CapRemap Redesign Implemented Month in Blue", x="Month", y="Average Daily Ridership in Zilker Neighborhood")+
geom_vline(xintercept = 0, color = "blue")+
scale_colour_manual(values = c("#0072B2", "#E7B800"),name="Time Period (Before,After)", breaks=c("0", "1"), labels=c("After", "Before"))setwd('C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801')
disagg = read.csv('Data/MUSA Disagregated Data Sample 01-06-2020 to 01-10-2020.csv')
nhood = st_read('https://data.austintexas.gov/resource/nz5f-3t2e.geojson')
disagn1 = read.delim('Data/May 2018 Detail 21 to 25.txt', header = FALSE, sep = "\t", dec = ".")
colnames(disagn1) <- colnames(disagg)
disagn2 = read.delim('Data/June 2018 Detail 11 to 15.txt', header = FALSE, sep = "\t", dec = ".")
colnames(disagn2) <- colnames(disagg)First, let us look at the Kmeans analysis before the CapRemap. We group the disaggregated data by routes, and calculated the max and mean number of passengers on bus at each stop, the average miles traveled and the average hours spent for each passenger at each stop, as well as the total run length and total run time of the route.
disagn1_km = disagn1 %>%
group_by(ROUTE) %>%
summarise(max_load = max(PSGR_LOAD), mean_load = mean(PSGR_LOAD),
psgmiles = mean(PSGR_MILES), psghours = mean(PSGR_HOURS),
rlength = sum(ACT_MILES_SINCE_LAST_STOP), rtime = sum(ACT_MINS_SINCE_LAST_STOP))Then, we apply Kmeans analysis. The number of clusters are determined by both the elbow chart and the 26 criteria provided by the NbClus package. For more information, see appendix.
fit.km <- kmeans(disagn1_km1, 3, nstart=25)
summary_before = cbind(round(aggregate(disagn1_km1, by=list(cluster=fit.km$cluster), mean),1),fit.km$size)
disagn1_km$cluster = fit.km$clusterWe do the same analysis to the disaggregated dataset after the CapRemap.
disagn2_km = disagn2 %>%
group_by(ROUTE) %>%
summarise(max_load = max(PSGR_LOAD), mean_load = mean(PSGR_LOAD),
psgmiles = mean(PSGR_MILES), psghours = mean(PSGR_HOURS),
rlength = sum(ACT_MILES_SINCE_LAST_STOP), rtime = sum(ACT_MINS_SINCE_LAST_STOP))fit.km <- kmeans(disagn2_km1, 3, nstart=25)
summary_after = cbind(round(aggregate(disagn2_km1, by=list(cluster=fit.km$cluster), mean),1),fit.km$size)
disagn2_km$cluster = fit.km$clusterThese clustering labels are joined to the original dataset. For more about the clustering result, please see appendix.
hotlines_pre = unique(disagn1_km[(disagn1_km$cluster == 2),]$ROUTE)
hotlines_po = unique(disagn2_km[(disagn2_km$cluster == 2),]$ROUTE)
print("Hotlines Before:")
print(hotlines_pre)
print("Hotlines After:")
print(hotlines_po)Find the number of kmeans clusters for both before and after the CapRemap:
Both the Elbow chart and the 26 indicies provided by the NbClust package are used to check how many clusters should be used in the Kmeans analysis.
Before CapRemap:
wss <- (nrow(disagn1_km1)-1)*sum(apply(disagn1_km1,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(disagn1_km1,centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")set.seed(1234)
nc <- NbClust(disagn1_km1, min.nc=2, max.nc=15, method="kmeans", index="all")
table(nc$Best.n[1,])
barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters \nChosen by 26 Criteria")After CapRemap:
wss <- (nrow(disagn2_km1)-1)*sum(apply(disagn2_km1,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(disagn2_km1, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")set.seed(1234)
nc <- NbClust(disagn2_km1, min.nc=2, max.nc=15, method="kmeans", index="all")
table(nc$Best.n[1,])
barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters \nChosen by 26 Criteria")In either case, it is evident that the most optimal number for the Kmeans cluster analysis is 3. We then conduct Kmeans analysis with 3 clusters as mentioned above in the exploratory analysis section.
Here is the Kmeans analysis result we got for before and after the CapRemap. The numbers are average of each feature used in the Kmeans analysis. We can clearly see that cluster 2 for both before and after the remapping have the highest average ridership as well as run times. They also have the smallest size. We can conclude that these are the most popular routes and we then define these routes as ‘hotlines’.
rbind(summary_before,summary_after) %>%
kable(caption = "K-Means Clustering Analysis Result: Mean Values for Each Cluster", col.names =c("Cluster", "Max Psg Load", "Mean Psg Load", "Psg Miles", "Psg Hours", "Total Run Length", "Total Run Time", "Cluster Size")) %>%
kable_styling("striped", full_width = F) %>%
pack_rows("Before the CapRemap (May 21 - 25, 2018)", 1, 3) %>%
row_spec(2, bold = T, color = "black", background = "#e7b800")%>%
pack_rows("After the CapRemap (June 11 - 15, 2018)", 4, 6)%>%
row_spec(5, bold = T, color = "black", background = "#e7b800")capmetro <-
st_read("C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801/Data/Shapefiles_-_JUNE_2018/Routes.shp")%>%
st_transform(2278)
capmetro0 <-st_read("C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801/Data/Shapefiles_-_JANUARY_2018/Routes.shp")%>%
st_transform(2278)grid.arrange(
ggplot()+
geom_sf(data = nhood, fill = 'black',color = 'grey30')+
geom_sf(data = capmetro0 %>%
filter(ROUTE_ID %in% c(7,300,801,803)),color = '#e7b800')+
labs(title='Hotlines Before:\nLine 7, 300, 801, 803')+ mapTheme(),
ggplot()+
geom_sf(data = nhood, fill = 'black',color = 'grey30')+
geom_sf(data = capmetro %>%
filter(ROUTE_ID %in% c(7,10,20,300,801,803)),color = '#e7b800') +
geom_sf(data = capmetro %>%
filter(ROUTE_ID == 20 | ROUTE_ID == 10), color='red')+
labs(title='Hotlines After:\nLine 7, 10, 20, 300, 801, 803')+mapTheme() ,ncol =2)routeplot1 <- function(n,p,p1,d) {
# line n before map
t1 = ggplot() +
geom_sf(data = nhood, color = 'grey30',fill = 'grey20') +
geom_sf(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p) %>%
st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
st_transform(2278) %>%
group_by(STOP_ID) %>%
summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8),
aes(color = mean_stop_load))+
scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25),
breaks = c(0, 5, 10, 15, 20, 25)) +
labs(title=paste("Line",n,"Direction 1, Before CapRemap"),
subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
#line n before passenger load chart
t11 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
group_by(STOP_SEQ_ID) %>%
summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF),
mean_load = mean(PSGR_LOAD))) +
geom_path(aes(x = STOP_SEQ_ID, y = mean_load,
size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25),
breaks = c(0, 5, 10, 15, 20, 25))+
scale_size_continuous()+
ylim(0, 23) +
labs(subtitle=paste("Average Passenger Load"))+plotTheme()+
theme(legend.position="none")
#line n before passenger boarding and alighting
t12 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
group_by(STOP_SEQ_ID) %>%
summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF),
mean_load = mean(PSGR_LOAD))) +
geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) +
geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+
ylim(0, 10) +
labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+
theme(legend.position="bottom", legend.box = "horizontal")
# line n after map
t2 = ggplot() +
geom_sf(data = nhood, color = 'grey30',fill = 'grey20') +
geom_sf(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p1) %>%
st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
st_transform(2278) %>%
group_by(STOP_ID) %>%
summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8),
aes(color = mean_stop_load))+
scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25),
breaks = c(0, 5, 10, 15, 20, 25)) +
labs(title=paste("Line",n,"Direction 1, After CapRemap"),
subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
#line n after passenger load chart
t21 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
group_by(STOP_SEQ_ID) %>%
summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF),
mean_load = mean(PSGR_LOAD))) +
geom_path(aes(x = STOP_SEQ_ID, y = mean_load,
size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25),
breaks = c(0, 5, 10, 15, 20, 25))+
scale_size_continuous()+
ylim(0, 23) +
labs(subtitle=paste("Average Passenger Load"))+plotTheme()+
theme(legend.position="none")
#line n after passenger boarding and alighting
t22 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
group_by(STOP_SEQ_ID) %>%
summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF),
mean_load = mean(PSGR_LOAD))) +
geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) +
geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+
ylim(0, 10) +
labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+
theme(legend.position="bottom", legend.box = "horizontal")
grid.arrange(arrangeGrob(t1, t11, t12, ncol = 1, nrow = 3),
arrangeGrob(t2, t21, t22, ncol = 1, nrow = 3),ncol=2)
}#####OSM#####
getOSM <- function(key,value){
feature <- opq(bbox = 'Austin, Texas')%>%
add_osm_feature(key = key, value = value) %>%
osmdata_sf ()
if(is.null(feature$osm_points)){
feature_poly <- feature$osm_polygons%>%
select(osm_id,geometry)%>%
st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
st_transform(2278)
return(feature_poly)
} else {
feature_pt <- feature$osm_points%>%
select(osm_id,geometry)%>%
st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
st_transform(2278)
return (feature_pt)
}
}#shop
shop <- opq(bbox = 'Austin, Texas')%>%
add_osm_feature(key = 'shop') %>%
osmdata_sf ()
shop_pt <- shop$osm_points
shop_pt <- shop_pt%>%
select(osm_id,
name,
addr.city,
geometry)%>%
st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
st_transform(2278)
#university
university <- opq(bbox = 'Austin, Texas')%>%
add_osm_feature(key = 'amenity',value = 'university') %>%
osmdata_sf ()
university <- university$osm_polygons%>%
select(geometry)%>%
st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
st_transform(2278)
shop_pt <- shop_pt%>%
select(osm_id,
name,
addr.city,
geometry)%>%
st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
st_transform(2278)
#commercial
commercial <- getOSM('building', 'commercial')
#retail
retail <- getOSM('building', 'retail')
#supermarket
supermkt <- getOSM('building', 'supermarket')
#office
office <- getOSM('building', 'office')
#residential
residential <- getOSM('building','residential')
#bar
bar <- getOSM('amenity', 'bar')
#school
school <- getOSM('amenity', 'school')
#uni
university <- getOSM('amenity', 'university')
#parking
parking <- getOSM('amenity', 'parking')
#statium
stadium <- getOSM('building', 'stadium')
#trainstation
trainstation <- getOSM('building', 'train_station')######spatial join#####
bufferInit <- function(Buffer, Points, Name){
if(class(Points$geometry) == "sfc_POINT"){
Init <- st_join(Buffer%>% select(STOP_ID), Points, join = st_contains)%>%
group_by(STOP_ID)%>%
summarize(count = n())%>%
rename(!!Name := count)
}else {
Init <- st_join(Buffer%>% select(STOP_ID), Points, join = st_intersects)%>%
group_by(STOP_ID)%>%
summarize(count = n())%>%
rename(!!Name := count)
}
}
CommercialInit <- bufferInit(StopBuff, commercial, 'commercial_count')
RetailInit <- bufferInit(StopBuff, retail, 'retail_count')
OfficeInit <- bufferInit(StopBuff, office, 'office_count')
ResidentialInit <- bufferInit(StopBuff, residential, 'residential_count')
SupermktInit <- bufferInit(StopBuff, supermkt, 'supermkt_count')
BarInit <- bufferInit(StopBuff, bar, 'bar_count')
UniInit <- bufferInit(StopBuff, university, 'university_count')
ParkingInit <- bufferInit(StopBuff, parking, 'parking_count')
SchoolInit <- bufferInit(StopBuff, school, 'school_count')
StationInit <- bufferInit(StopBuff, trainstation, 'station_count')
StadiumInit <- bufferInit(StopBuff, stadium, 'stadium_count')
#plot OSM
plotOSM <- function(OSM)
ggplot()+
geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL) +
geom_sf(data = stops, aes(color = UniInit$university_count),size = 0.8) +
mapTheme()#####census#####
options(tigris_use_cache = TRUE)
v17 <- load_variables(2017, "acs5", cache = TRUE)
Hays <- get_acs(state = "48", county = "209", geography = "tract",
variables = "B01001_001", geometry = TRUE)
Travis <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B01001_001", geometry = TRUE)
Williamson <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B01001_001", geometry = TRUE)
Travis_race <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B02001_002", geometry = TRUE)
Williamson_race <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B02001_002", geometry = TRUE)
Travis_noveh <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B08014_002", geometry = TRUE)
Williamson_noveh <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B08014_002", geometry = TRUE)
Travis_oneveh <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B08014_003", geometry = TRUE)
Williamson_oneveh <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B08014_003", geometry = TRUE)
Travis_twoveh <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B08014_004", geometry = TRUE)
Williamson_twoveh <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B08014_004", geometry = TRUE)
Travis_threeveh <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B08014_005", geometry = TRUE)
Williamson_threeveh <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B08014_005", geometry = TRUE)
Travis_fourveh <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B08014_006", geometry = TRUE)
Williamson_fourveh <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B08014_006", geometry = TRUE)
Travis_fiveveh <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B08014_007", geometry = TRUE)
Williamson_fiveveh <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B08014_007", geometry = TRUE)
Travis_poverty <- get_acs(state = "48", county = "453", geography = "tract",
variables = "B06012_002", geometry = TRUE)
Williamson_poverty <- get_acs(state = "48", county = "491", geography = "tract",
variables = "B06012_002", geometry = TRUE)#####buffer deomographics#####
#demo data bind
Population <- rbind(Travis, Williamson)%>%
st_transform(2278)
Population_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Population, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
Population_buff$population <- round(Population_buff$estimate)Race <- rbind(Travis_race, Williamson_race)%>%
st_transform(2278)
Race_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Race, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
Race_buff$race <- round(Race_buff$estimate)
NoVeh <- rbind(Travis_noveh, Williamson_noveh)%>%
st_transform(2278)
NoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = NoVeh, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
NoVeh_buff$estimate <- round(NoVeh_buff$estimate)
OneVeh <- rbind(Travis_oneveh, Williamson_oneveh)%>%
st_transform(2278)
OneVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = OneVeh, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
OneVeh_buff$estimate <- round(OneVeh_buff$estimate)
TwoVeh <- rbind(Travis_twoveh, Williamson_twoveh)%>%
st_transform(2278)
TwoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = TwoVeh, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
TwoVeh_buff$estimate <- round(TwoVeh_buff$estimate)
ThreeVeh <- rbind(Travis_threeveh, Williamson_threeveh)%>%
st_transform(2278)
ThreeVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = ThreeVeh, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
ThreeVeh_buff$estimate <- round(ThreeVeh_buff$estimate)
FourVeh <- rbind(Travis_fourveh, Williamson_fourveh)%>%
st_transform(2278)
FourVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FourVeh, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
FourVeh_buff$estimate <- round(FourVeh_buff$estimate)
FiveVeh <- rbind(Travis_fiveveh, Williamson_fiveveh)%>%
st_transform(2278)
FiveVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FiveVeh, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
FiveVeh_buff$estimate <- round(FiveVeh_buff$estimate)
Poverty <- rbind(Travis_poverty, Williamson_poverty)%>%
st_transform(2278)
Poverty_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Poverty, sid = GEOID, weight = "sum",
output = "sf", extensive = "estimate")
Poverty_buff$estimate <- round(Poverty_buff$estimate)#####Time Lag#####
disagg$ACT_STOP_TIME <- as.character(disagg$ACT_STOP_TIME)
disagg <- disagg%>%
mutate(interval60 = floor_date(mdy_hm(ACT_STOP_TIME), unit = "hour"),
interval15 = floor_date(mdy_hm(ACT_STOP_TIME), unit = "15 mins"))
study.panel <-
expand.grid(interval60=unique(disagg$interval60),
STOP_ID = unique(disagg$STOP_ID))
disagg.panel <- disagg%>%
right_join(study.panel)%>%
group_by(interval60, STOP_ID)%>%
summarize(avg_on = mean(PSGR_ON))
disagg.timelag <-
disagg.panel %>%
arrange(STOP_ID, interval60) %>%
mutate(lagHour = dplyr::lag(avg_on,1),
lag2Hours = dplyr::lag(avg_on,2),
lag3Hours = dplyr::lag(avg_on,3),
lag4Hours = dplyr::lag(avg_on,4),
lag12Hours = dplyr::lag(avg_on,12),
lag1day = dplyr::lag(avg_on,24))#Building Area Feature Engineering
#Create the polygon buffer function
bufferPoly <- function(Buffer, Polygons, Name){
if(class(Polygons$geometry) == "sfc_POLYGON"){
Poly <- st_join(Buffer%>% select(STOP_ID), Polygons, join = st_intersects)%>%
group_by(STOP_ID)%>%
summarize(area = sum(Total_area))%>%
rename(!!Name := area)
}
#else {
# Poly <- st_join(Buffer%>% select(STOP_ID), Polygons, join = st_intersects)%>%
# group_by(STOP_ID)%>%
# summarize(count = n())%>%
# rename(!!Name := count)
#}
}
#Import building footprint shapefile
Buildings <-
st_read("C:/Upenn/Practicum/Data/building_footprints_2017/building_footprints_2017.shp")%>%
st_transform(2278)
#Import Stop shp
stops <-
st_read("C:/Upenn/Practicum/Data/Shapefiles_-_JUNE_2018/Stops.shp") %>%
st_transform(2278)
#Create columns of the num of floors and total building areas
Buildings$Floor <- round(Buildings$MAX_HEIGHT/10)
Buildings$Total_area <- Buildings$Floor * Buildings$SHAPE_AREA
AreaPoly <- bufferPoly(StopBuff, Buildings, 'building_area')
AreaPoly$geometry <- NULL
AreaPoly2 <- bufferPoly(StopBuff2, Buildings, 'building_area')
write.csv(AreaPoly, "C:/Upenn/Practicum/Data/Building_Area2.csv")
write.csv(AreaPoly2, "C:/Upenn/Practicum/Data/Building_Area2.csv")
?merge
Austin.sf <- merge(AreaPoly, data.2018, by= "STOP_ID", all.x=TRUE)
Austin.sf$geometry <- NULL
Austin.sf<- Austin.sf[-c(25158,25159,25160,25161,25162,25163,25164,35408,37475,37476,37477,37478,37479,37480,37481,39140,39153,39292,42526,42527,42528,42529,42530,42531,
42532,42533,42534,44292,44293,45249,45250,48882,48883,48915,48916), ]
which(is.na(Austin.sf$LONGITUDE))
Austin.sf <-
Austin.sf %>%
st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
st_transform(2278)
library(wesanderson)
palette5 <- wes_palette("Moonrise3", n = 5)
palette5 <-
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
ggplot() +
geom_sf(data = nhood, fill = "grey40") +
geom_sf(data = Austin.sf, aes(colour = q5(building_area)), show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(Austin.sf,"building_area"),
name="Quintile\nBreaks") +
labs(title="Building Area within 1/4 Mile Buffer from each Stop") +
mapTheme()# use datasets after cap remap
route_stop = disagn2j %>% group_by(ROUTE, TRIP, STOP_ID) %>% slice(1)# reorganize the direction feature
capmetro_dir_mod = capmetro %>%
select(ROUTE_ID, DIRECTION, ROUTETYPE) %>%
st_drop_geometry() %>%
mutate(DTYPES=case_when(
DIRECTION=="Southbound"~"SouthNorth",
DIRECTION=="Northbound"~"SouthNorth",
DIRECTION=="Westbound"~"WestEast",
DIRECTION=="Eastbound"~"WestEast",
DIRECTION=="Inbound"~"InOut",
DIRECTION=="Outbound"~"InOut",
DIRECTION=="Counterclockwise"~"Counterclockwise",
DIRECTION=="Clockwise"~"Clockwise",
))dis_cap1 = merge(x = route_stop %>% select(ROUTE, STOP_ID),
y = capmetro_dir_mod %>%
group_by(ROUTETYPE, DTYPES, ROUTE_ID) %>%
slice(1),
by.x = 'ROUTE', by.y = 'ROUTE_ID')# number of shifts (total frequency); route orientation frequency at each stop
stop_dir_freq = merge(x = dis_cap1 %>% group_by(STOP_ID) %>% summarise(nshifts=n()),
y = dis_cap1 %>% group_by(STOP_ID, DTYPES) %>% summarise(n=n()) %>%
mutate(prop_route=n/sum(n)) %>% subset(select=c("STOP_ID","DTYPES","prop_route")) %>%
spread(DTYPES, prop_route), by = "STOP_ID")
stop_dir_freq[is.na(stop_dir_freq)] <- 0# route type frequency at each stop
stop_type_freq = dis_cap1 %>% group_by(STOP_ID, ROUTETYPE) %>% summarise(n=n()) %>%
mutate(prop_type=n/sum(n)) %>% subset(select=c("STOP_ID","ROUTETYPE","prop_type")) %>%
spread(ROUTETYPE, prop_type)
stop_type_freq[is.na(stop_type_freq)] <- 0
stop_type_freq[11] <- NULL # remove NA column# hotlines frequency & hub dummy
hotline_dum = as.numeric(dis_cap1$ROUTE %in% hotlines_po$hotlines_po)
stop_hot = cbind(dis_cap1, hotline_dum)
stop_hot_freq = stop_hot %>% group_by(STOP_ID, hotline_dum) %>% summarise(n=n()) %>%
mutate(prop_hot=n/sum(n)) %>% subset(select=c("STOP_ID","hotline_dum","prop_hot")) %>%
spread(hotline_dum, prop_hot)
stop_hot_freq[is.na(stop_hot_freq)] <- 0
colnames(stop_hot_freq) <- c("STOP_ID", "hotline_0", "hotline_1")hub_stop <-
st_read("C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801/Data/Shapefiles_-_JUNE_2018/Hub_Stops.shp") %>% st_transform(2278)hub_dum <- as.numeric(unique(dis_cap1$STOP_ID) %in% hub_stop$STOP_ID)
hub_dum = data.frame(STOP_ID = stop_type_freq$STOP_ID, hubs = hub_dum)
stop_type_freq = merge(stop_type_freq, hub_dum, by = "STOP_ID")# building area
build_dens <- read.csv("C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801/Data/Building_Area2.csv")# ridership calc (import data see Data_Transform_New_agg.R)
data.2019$STOP_ID = as.numeric(data.2019$STOP_ID)
data.2019.mean = data.2019 %>%
group_by(STOP_ID) %>%
summarise(mean_on = mean(as.numeric(AVERAGE_LOAD)))# add land use data as fixed effect
landuse <-
st_read("https://data.austintexas.gov/api/geospatial/fj9m-h5qy?method=export&format=GeoJSON") %>%
st_transform(2278)# re-categorize the general land use column
landuse_x = landuse %>%
select(shape_area, general_land_use, land_use, geometry) %>%
mutate(land_use_x=case_when(
general_land_use %in% c(100, 113, 200) ~ "residential",
general_land_use == 300 ~ "commercial",
general_land_use == 330 ~ "mixed_use",
general_land_use == 400 ~ "office",
general_land_use == 500 ~ "industrial",
general_land_use == 600 ~ "civic",
general_land_use %in% c(700, 940) ~ "nature",
general_land_use %in% c(800, 860) ~ "transportation",
general_land_use %in% c(870, 900, 999) ~ "others"
))#find the composition of the landuse surrounding each stop
stop_buff_landuse = all_x2 %>%
group_by(STOP_ID, land_use_x) %>%
summarize(n=sum(as.numeric(shape_area))) %>%
mutate(prop_landuse=n/sum(n)) %>%
select(STOP_ID, prop_landuse, land_use_x) %>%
spread(land_use_x, prop_landuse)
#drop the na column
stop_buff_landuse[12] <- NULL
#fill na with 0
stop_buff_landuse[is.na(stop_buff_landuse)] <- 0# neighborhood as fixed effect
stop_nhood = stops %>%
select(STOP_ID, geometry) %>%
st_as_sf() %>%
st_transform(2278) %>%
st_join(nhood %>% select(label))
#fill in the na values (outside all the neighborhood boundaries) with "Not Defined"
stop_nhood$label <- as.character(stop_nhood$label)
stop_nhood[is.na(stop_nhood)] <- "Not Defined"# school district as fixed effect
stop_school = stops %>%
select(STOP_ID, geometry) %>%
st_as_sf() %>%
st_transform(2278) %>%
st_join(TrusteeDist %>% select(TRUSTEE))
#fill in the na values (outside all the neighborhood boundaries) with "Not Defined"
stop_school$TRUSTEE <- as.character(stop_school$TRUSTEE)
stop_school[is.na(stop_school)] <- "Not Defined"
stop_school$TRUSTEE <- as.factor(stop_school$TRUSTEE)## k nearest neighbor distance as features
## function for k nearest neighbor
nn_function <- function(measureFrom,measureTo,k) {
nn <-
FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint)
return(output)
}#make the stop points into lat long format
stopsXY <- stops %>%
select(STOP_ID, geometry) %>%
st_as_sf() %>%
st_transform(2278) %>%
mutate(lat = st_coordinates(.)[,2],
lng = st_coordinates(.)[,1]) %>%
as.data.frame() %>%
select(lat,lng) %>%
as.matrix()#function that change the geometry into lat long
makeXY <- function(original_data, geometry_name) {
output = original_data %>%
select(geometry) %>%
st_as_sf() %>%
st_transform(2278) %>%
mutate(lat = st_coordinates(.)[,2],
lng = st_coordinates(.)[,1]) %>%
as.data.frame() %>%
select(lat,lng) %>%
as.matrix()
return(output)
}# function that create the knn column ready for join
amenity_dist <- function(original_data, geometry_name, k, new_name) {
makeXY <- makeXY(original_data, geometry_name)
pointDistance <- nn_function(stopsXY, makeXY, k)
names(pointDistance) <- new_name
return(pointDistance)
}#add point of interest locations
#UT Austin
UTAustin <- data_frame(name = "UT Austin", lat = 30.285350, lon = -97.734458) %>%
st_as_sf(coords = c("lon","lat"), crs = 4326, agr = "constant") %>%
st_transform(2278)
UTAustin_bound <- st_read("C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801/Data/UTAustin/POLYGON.shp") %>%
st_transform(2278)
#CBD
CBD_bound <- st_read("C:/Users/HanyongXu/Documents/Me/grad/Spring_2020/MUSA801/Data/CBD/POLYGON.shp") %>%
st_transform(2278)
#air port
airport <- data_frame(name = "Airport", lat = 30.202741, lon = -97.666771) %>%
st_as_sf(coords = c("lon","lat"), crs = 4326, agr = "constant") %>%
st_transform(2278)ggplot() +
#geom_sf(data = nhood) +
#geom_sf(data = stops, aes(color = as.numeric(st_distance(stops, UTAustin_bound, by_element = TRUE))),size=0.5) +
geom_sf(data = UTAustin_bound) +
#geom_sf(data = stops, aes(color = as.numeric(st_distance(stops, CBD_bound))),size=0.5) +
#geom_sf(data = CBD_bound) +
mapTheme()#find the distance to UT Austin
utaustinDist = st_distance(stops, UTAustin_bound) %>%
as.data.frame() %>%
rowwise() %>%
mutate(utaustin_dist = min(V1, V2, V3, V4, V5, V6, V7)) %>%
select(utaustin_dist)#find the distance to CBD
CBDDist = st_distance(stops, CBD_bound) %>%
as.numeric() %>%
as.data.frame()
colnames(CBDDist) = "CBD_dist"# calculate the distance to 3 nearest amenities
commercialDist = amenity_dist(commercial, geometry, 3, "commercialDist")
retailDist = amenity_dist(retail, geometry, 3, "retailDist")
supermktDist = amenity_dist(supermkt, geometry, 3, "supermktDist")
officeDist = amenity_dist(office, geometry, 3, "officeDist")
residentialDist = amenity_dist(residential, geometry, 3, "residentialDist")
barDist = amenity_dist(bar, geometry, 3, "barDist")
schoolDist = amenity_dist(school, geometry, 3, "schoolDist")
universityDist = amenity_dist(university, geometry, 3, "universityDist")
parkingDist = amenity_dist(parking, geometry, 3, "parkingDist")
stadiumDist = amenity_dist(stadium, geometry, 3, "stadiumDist")
trainstationDist = amenity_dist(trainstation, geometry, 3, "trainstationDist")
universityDist = amenity_dist(university, geometry, 3, "universityDist")
airportDist = amenity_dist(airport, geometry, 1, "airportDist")## spatial lag
spatial_lag = get.knn(stopsXY, 5)$nn.index %>%
as.data.frame() %>%
rownames_to_column(var = "thisPoint") %>%
mutate(ridership1 = all_x1$mean_on[.$V1], ridership2 = all_x1$mean_on[.$V2],
ridership3 = all_x1$mean_on[.$V3], ridership4 = all_x1$mean_on[.$V4],
ridership5 = all_x1$mean_on[.$V5]) %>%
mutate(spatial_lag2 = rowMeans(.[,7:8], na.rm = TRUE),
spatial_lag3 = rowMeans(.[,7:9], na.rm = TRUE),
spatial_lag5 = rowMeans(.[,7:11], na.rm = TRUE))#census rename
Population_buff$population <- round(Population_buff$estimate)
Race_buff$race <- round(Race_buff$estimate)
NoVeh_buff$NoVeh <- round(NoVeh_buff$estimate)
OneVeh_buff$OneVeh <- round(OneVeh_buff$estimate)
TwoVeh_buff$TwoVeh <- round(TwoVeh_buff$estimate)
ThreeVeh_buff$ThreeVeh <- round(ThreeVeh_buff$estimate)
FourVeh_buff$FourVeh <- round(FourVeh_buff$estimate)
FiveVeh_buff$FiveVeh <- round(FiveVeh_buff$estimate)
Poverty_buff$Poverty <- round(Poverty_buff$estimate)all_x1 = CommercialInit %>% #amenities and route related
left_join(st_drop_geometry(RetailInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(OfficeInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(ResidentialInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(SupermktInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(BarInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(UniInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(ParkingInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(SchoolInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(StationInit), by = "STOP_ID") %>%
left_join(st_drop_geometry(StadiumInit), by = "STOP_ID") %>%
left_join(stop_dir_freq, by = "STOP_ID") %>%
left_join(stop_type_freq, by = "STOP_ID") %>%
left_join(stop_hot_freq, by = "STOP_ID") %>%
left_join(build_dens, by = "STOP_ID") %>%
left_join(st_drop_geometry(stop_buff_landuse), by = "STOP_ID") %>%
left_join(st_drop_geometry(Race_buff) %>% select(STOP_ID, race) %>% mutate(STOP_ID = as.numeric(STOP_ID), race = as.numeric(race)), by = "STOP_ID") %>% #census data
left_join(st_drop_geometry(Population_buff) %>% select(STOP_ID, population) %>% mutate(STOP_ID = as.numeric(STOP_ID), population = as.numeric(population)), by = "STOP_ID") %>%
left_join(st_drop_geometry(NoVeh_buff) %>% select(STOP_ID, NoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), NoVeh = as.numeric(NoVeh)), by = "STOP_ID") %>%
left_join(st_drop_geometry(OneVeh_buff) %>% select(STOP_ID, OneVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), OneVeh = as.numeric(OneVeh)), by = "STOP_ID") %>%
left_join(st_drop_geometry(TwoVeh_buff) %>% select(STOP_ID, TwoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), TwoVeh = as.numeric(TwoVeh)), by = "STOP_ID") %>%
left_join(st_drop_geometry(ThreeVeh_buff) %>% select(STOP_ID, ThreeVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), ThreeVeh = as.numeric(ThreeVeh)), by = "STOP_ID") %>%
left_join(st_drop_geometry(FourVeh_buff) %>% select(STOP_ID, FourVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FourVeh = as.numeric(FourVeh)), by = "STOP_ID") %>%
left_join(st_drop_geometry(FiveVeh_buff) %>% select(STOP_ID, FiveVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FiveVeh = as.numeric(FiveVeh)), by = "STOP_ID") %>%
left_join(st_drop_geometry(Poverty_buff) %>% select(STOP_ID, Poverty) %>% mutate(STOP_ID = as.numeric(STOP_ID), Poverty = as.numeric(Poverty)), by = "STOP_ID") %>%
left_join(st_drop_geometry(stop_nhood), by = "STOP_ID") %>% # fixed effects
left_join(st_drop_geometry(stop_school), by = "STOP_ID") %>%
select(-c(hotline_0, X)) %>%
left_join(data.2019.mean, by = "STOP_ID")
#spatial lag, knn dist
all_x3 = bind_cols(list(all_x1, utaustinDist, CBDDist, commercialDist, retailDist, supermktDist, officeDist, residentialDist, schoolDist, universityDist, parkingDist, stadiumDist, trainstationDist, universityDist, airportDist, spatial_lag %>% select(spatial_lag2, spatial_lag3, spatial_lag5)))
#time lagall1cor <-
all_x3 %>%
select(-c(STOP_ID, label, TRUSTEE)) %>%
st_set_geometry(NULL)
M <- cor(na.omit(all1cor))
corrplot(M, number.cex = .7, tl.srt = 90, method = "color", type = "upper", tl.col = "black", na.label = ".", na.rm = T, tl.cex = 0.7, col = brewer.pal(n = 10, name = "RdYlBu"))correlation.long <-
st_set_geometry(all_x3, NULL) %>%
select(-c(STOP_ID, label, TRUSTEE)) %>%
gather(Variable, Value, -mean_on)correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(as.numeric(Value), mean_on, use = "complete.obs"))
ggplot(correlation.long, aes(Value, mean_on)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#2C7BB6") +
facet_wrap(~Variable, ncol = 5, scales = "free") +
labs(title = "Mean Ridership as A Function of Potential Factors")#positive correlation
correlation.long1 <-
st_set_geometry(all_x3 %>% select("High Frequency", hotline_1, nshifts, SouthNorth, spatial_lag5, mean_on), NULL) %>%
gather(Variable, Value, -mean_on)
#negative correlation
correlation.long2 <-
st_set_geometry(all_x3 %>% select(trainstationDist, utaustin_dist, WestEast, airportDist, CBD_dist, mean_on), NULL) %>%
gather(Variable, Value, -mean_on)correlation.cor1 <-
correlation.long1 %>%
group_by(Variable) %>%
summarize(correlation = cor(as.numeric(Value), mean_on, use = "complete.obs"))
correlation.cor2 <-
correlation.long2 %>%
group_by(Variable) %>%
summarize(correlation = cor(as.numeric(Value), mean_on, use = "complete.obs"))ggplot(correlation.long1, aes(Value, mean_on)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor1, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#2C7BB6") +
facet_wrap(~Variable, ncol = 5, scales = "free") +
labs(title = "Mean Ridership as A Function of Positively Related Factors") + plotTheme()ggplot(correlation.long2, aes(Value, mean_on)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor2, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#2C7BB6") +
facet_wrap(~Variable, ncol = 5, scales = "free") +
labs(title = "Mean Ridership as A Function of Negatively Related Factors") + plotTheme()all_x1_ = all_x3 %>% st_set_geometry(NULL)
lmreg <- lm(mean_on ~ .,data = all_x1_)
summary(lmreg)
lm_model <-
all_x3 %>%
mutate(ridership.Predict = predict(lmreg, all_x1_),
ridership.Error = mean_on - ridership.Predict,
ridership.AbsError = abs(mean_on - ridership.Predict),
ridership.APE = (abs(mean_on - ridership.Predict) / mean_on))plot_result = lm_model %>%
st_drop_geometry() %>%
left_join(stops %>% select(STOP_ID)) %>%
st_as_sf()
ggplot()+
geom_sf(data = nhood, color = 'grey80',fill = 'grey80') +
geom_sf(data = plot_result, aes(color = plot_result$ridership.APE ),size = 1.5) +
labs(title = "Ridership Prediction Absolute Percentage Error") +
mapTheme()load.fun <- function(x) {
x <- as.character(x)
if(isTRUE(x %in% .packages(all.available=TRUE))) {
eval(parse(text=paste("require(", x, ")", sep="")))
print(paste(c(x, " : already installed; requiring"), collapse=''))
} else {
#update.packages()
print(paste(c(x, " : not installed; installing"), collapse=''))
eval(parse(text=paste("install.packages('", x, "')", sep="")))
print(paste(c(x, " : installed and requiring"), collapse=''))
eval(parse(text=paste("require(", x, ")", sep="")))
}
}
packages = c("scales", "digest","colorspace","prodlim","yardstick","dplyr", "bayesplot", "lme4", "rstan", "shinystan", "RcppEigen",
"tidyverse", "tidyr", "AmesHousing", "broom", "caret", "dials", "doParallel", "e1071", "earth",
"ggrepel", "glmnet", "ipred", "klaR", "kknn", "pROC", "rpart", "randomForest",
"sessioninfo", "tidymodels","ranger", "recipes", "workflows", "themis","xgboost",
"sf", "nngeo", "mapview","parsnip", "vctrs", "rsample", "tune")
for(i in seq_along(packages)){
packge <- as.character(packages[i])
load.fun(packge)
}
warnings()
library(bayesplot)
library(lme4)
library(rstan)
library(shinystan)
library(RcppEigen)
library(broom)
library(glmnet)
library(tune)
library(yardstick)#Slipt the data into training and testing sets
data_split <- rsample::initial_split(data, strata = "mean_on", prop = 0.75)
bus_train <- rsample::training(data_split)
bus_test <- rsample::testing(data_split)
table(bus_train$label)
cv_splits_geo <- rsample::group_vfold_cv(bus_train, strata = "mean_on", group = "label")
print(cv_splits_geo)#Create recipe
model_rec <- recipe(mean_on ~ ., data = bus_train) %>% #the "." means using every variable we have in the training dataset
update_role(label, new_role = "label") %>% #This is more like to keep the neighborhood variable out of the model
step_other(label, threshold = 0.005) %>%
step_dummy(all_nominal(), -label) %>%
step_log(mean_on) %>%
step_zv(all_predictors()) %>%
step_center(all_predictors(), -mean_on) %>%
step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))#Build the model
lm_plan <-
parsnip::linear_reg() %>%
parsnip::set_engine("lm")
glmnet_plan <-
parsnip::linear_reg() %>%
parsnip::set_args(penalty = tune()) %>%
parsnip::set_args(mixture = tune()) %>%
parsnip::set_engine("glmnet")
rf_plan <- parsnip::rand_forest() %>%
parsnip::set_args(mtry = tune()) %>%
parsnip::set_args(min_n = tune()) %>%
parsnip::set_args(trees = 1000) %>%
parsnip::set_engine("ranger", importance = "impurity") %>%
parsnip::set_mode("regression")
XGB_plan <- parsnip::boost_tree() %>%
parsnip::set_args(mtry = tune()) %>%
parsnip::set_args(min_n = tune()) %>%
parsnip::set_args(trees = 100) %>%
parsnip::set_engine("xgboost") %>%
parsnip::set_mode("regression")
#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25),
mixture = seq(0,1,0.25))
rf_grid <- expand.grid(mtry = c(2,5),
min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5),
min_n = c(1,5))#Create workflow
lm_wf <-
workflows::workflow() %>%
add_recipe(model_rec) %>%
add_model(lm_plan)
glmnet_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(glmnet_plan)
rf_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(rf_plan)
xgb_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(XGB_plan)# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
lm_tuned <- lm_wf %>%
fit_resamples(.,
resamples = cv_splits_geo,
control = control,
metrics = metric_set(rmse, rsq))
glmnet_tuned <- glmnet_wf %>%
tune_grid(.,
resamples = cv_splits_geo,
grid = glmnet_grid,
control = control,
metrics = metric_set(rmse, rsq))
rf_tuned <- rf_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = rf_grid,
control = control,
metrics = metric_set(rmse, rsq))
xgb_tuned <- xgb_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = xgb_grid,
control = control,
metrics = metric_set(rmse, rsq))
show_best(lm_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned, metric = "rmse", n = 15, maximize = FALSE)
lm_best_params <- select_best(lm_tuned, metric = "rmse", maximize = FALSE)
glmnet_best_params <- select_best(glmnet_tuned, metric = "rmse", maximize = FALSE)
rf_best_params <- select_best(rf_tuned, metric = "rmse", maximize = FALSE)
xgb_best_params <- select_best(xgb_tuned, metric = "rmse", maximize = FALSE)#Final workflow
lm_best_wf <- finalize_workflow(lm_wf, lm_best_params)
glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf <- finalize_workflow(xgb_wf, xgb_best_params)
lm_val_fit_geo <- lm_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
glmnet_val_fit_geo <- glmnet_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
rf_val_fit_geo <- rf_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
xgb_val_fit_geo <- xgb_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned)
glmnet_best_OOF_preds <- collect_predictions(glmnet_tuned) %>%
filter(penalty == glmnet_best_params$penalty[1] & mixture == glmnet_best_params$mixture[1])
rf_best_OOF_preds <- collect_predictions(rf_tuned) %>%
filter(mtry == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])
xgb_best_OOF_preds <- collect_predictions(xgb_tuned) %>%
filter(mtry == xgb_best_params$mtry[1] & min_n == xgb_best_params$min_n[1])# collect validation set predictions from last_fit model
lm_val_pred_geo <- collect_predictions(lm_val_fit_geo)
glmnet_val_pred_geo <- collect_predictions(glmnet_val_fit_geo)
rf_val_pred_geo <- collect_predictions(rf_val_fit_geo)
xgb_val_pred_geo <- collect_predictions(xgb_val_fit_geo)# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds <- rbind(data.frame(dplyr::select(lm_best_OOF_preds, .pred, mean_on), model = "lm"),
data.frame(dplyr::select(glmnet_best_OOF_preds, .pred, mean_on), model = "glmnet"),
data.frame(dplyr::select(rf_best_OOF_preds, .pred, mean_on), model = "RF"),
data.frame(dplyr::select(xgb_best_OOF_preds, .pred, mean_on), model = "xgb")) %>%
group_by(model) %>%
mutate(.pred = exp(.pred),
mean_on = exp(mean_on),
RMSE = yardstick::rmse_vec(mean_on, .pred),
MAE = yardstick::mae_vec(mean_on, .pred),
MAPE = yardstick::mape_vec(mean_on, .pred)) %>%
ungroup()# Aggregate predictions from Validation set
val_preds <- rbind(data.frame(lm_val_pred_geo, model = "lm"),
data.frame(glmnet_val_pred_geo, model = "glmnet"),
data.frame(rf_val_pred_geo, model = "rf"),
data.frame(xgb_val_pred_geo, model = "xgb")) %>%
left_join(., data %>%
rowid_to_column(var = ".row") %>%
dplyr::select(label, .row),
by = ".row") %>%
group_by(model) %>%
mutate(.pred = exp(.pred),
mean_on = exp(mean_on),
RMSE = yardstick::rmse_vec(mean_on, .pred),
MAE = yardstick::mae_vec(mean_on, .pred),
MAPE = yardstick::mape_vec(mean_on, .pred)) %>%
ungroup()###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds %>%
dplyr::select(model, MAPE) %>%
distinct() ,
aes(x = model, y = MAPE, group = 1)) +
geom_path(color = "blue") +
geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
labs(title= "MAPE of each model on the testing set")
theme_bw()
#MAE chart
ggplot(data = val_preds %>%
dplyr::select(model, MAE) %>%
distinct() ,
aes(x = model, y = MAE, group = 1)) +
geom_path(color = "blue") +
geom_label(aes(label = paste0(round(MAE,1)))) +
labs(title= "MAE of each model on the testing set")
theme_bw()
#Predicted vs Observed
ggplot(val_preds, aes(x =.pred, y = mean_on, group = model)) +
geom_point() +
geom_abline(linetype = "dashed", color = "red") +
geom_smooth(method = "lm", color = "blue") +
coord_equal() +
facet_wrap(~model, nrow = 2) +
labs(title="Predicted vs Observed on the testing set", subtitle= "blue line is predicted value")
theme_bw()
#Neighborhood validation
val_MAPE_by_hood <- val_preds %>%
group_by(label, model) %>%
summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
MAE = yardstick::mae_vec(mean_on, .pred),
MAPE = yardstick::mape_vec(mean_on, .pred)) %>%
ungroup()
#Barchart of the MAE in each neighborhood
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")
val_MAPE_by_hood %>%
dplyr::select(label, model, MAE) %>%
gather(Variable, MAE, -model, -label) %>%
ggplot(aes(label, MAE)) +
geom_bar(aes(fill = model), position = "dodge", stat="identity") +
scale_fill_manual(values = "Blues") +
facet_wrap(~label,scales="free", ncol=6)+
labs(title = "Mean Absolute Errors by model specification and neighborhood") +
plotTheme()
#Map of MAE in each neighborhood
#Add geometry to the MAE
MAE.nhood <- merge(nhood, val_MAPE_by_hood, by.x="label", by.y="label", all.y=TRUE)
#Produce the map
#Map: MAPE of lm
MAE.nhood%>%
filter(model=="lm") %>%
ggplot() +
# geom_sf(data = nhoods, fill = "grey40") +
geom_sf(aes(fill = q5(MAPE))) +
scale_fill_brewer(palette = "Blues",
aesthetics = "fill",
labels=qBr(MAE.nhood,"MAPE"),
name="Quintile\nBreaks, (%)") +
labs(title="MAPE of lm in Neighborhoods") +
mapTheme()
#Map: MAPE of glmnet
MAE.nhood%>%
filter(model=="glmnet") %>%
ggplot() +
# geom_sf(data = nhoods, fill = "grey40") +
geom_sf(aes(fill = q5(MAPE))) +
scale_fill_brewer(palette = "Blues",
aesthetics = "fill",
labels=qBr(MAE.nhood,"MAPE"),
name="Quintile\nBreaks, (%)") +
labs(title="MAPE of glmnet in Neighborhoods") +
mapTheme()
#MAPE of rf
MAE.nhood%>%
filter(model=="rf") %>%
ggplot() +
# geom_sf(data = nhoods, fill = "grey40") +
geom_sf(aes(fill = q5(MAPE))) +
scale_fill_brewer(palette = "Blues",
aesthetics = "fill",
labels=qBr(MAE.nhood,"MAPE"),
name="Quintile\nBreaks, (%)") +
labs(title="MAPE of rf in Neighborhoods") +
mapTheme()
#MAPE of xgb
MAE.nhood%>%
filter(model=="xgb") %>%
ggplot() +
# geom_sf(data = nhoods, fill = "grey40") +
geom_sf(aes(fill = q5(MAPE))) +
scale_fill_brewer(palette = "Blues",
aesthetics = "fill",
labels=qBr(MAE.nhood,"MAPE"),
name="Quintile\nBreaks, (%)") +
labs(title="MAPE of xgb in Neighborhoods") +
mapTheme()########################Test on school District##########
#Create recipe
model_rec <- recipe(mean_on ~ ., data = bus_train) %>% #the "." means using every variable we have in the training dataset
update_role(TRUSTEE, new_role = "TRUSTEE") %>% #This is more like to keep the neighborhood variable out of the model
step_other(TRUSTEE, threshold = 0.005) %>%
step_dummy(all_nominal(), -TRUSTEE) %>%
step_log(mean_on) %>%
step_zv(all_predictors()) %>%
step_center(all_predictors(), -mean_on) %>%
step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
model_rec#Build the model
lm_plan <-
parsnip::linear_reg() %>%
parsnip::set_engine("lm")
glmnet_plan <-
parsnip::linear_reg() %>%
parsnip::set_args(penalty = tune()) %>%
parsnip::set_args(mixture = tune()) %>%
parsnip::set_engine("glmnet")
rf_plan <- parsnip::rand_forest() %>%
parsnip::set_args(mtry = tune()) %>%
parsnip::set_args(min_n = tune()) %>%
parsnip::set_args(trees = 1000) %>%
parsnip::set_engine("ranger", importance = "impurity") %>%
parsnip::set_mode("regression")
XGB_plan <- parsnip::boost_tree() %>%
parsnip::set_args(mtry = tune()) %>%
parsnip::set_args(min_n = tune()) %>%
parsnip::set_args(trees = 100) %>%
parsnip::set_engine("xgboost") %>%
parsnip::set_mode("regression")
#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25),
mixture = seq(0,1,0.25))
rf_grid <- expand.grid(mtry = c(2,5),
min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5),
min_n = c(1,5))#Create workflow
lm_wf <-
workflows::workflow() %>%
add_recipe(model_rec) %>%
add_model(lm_plan)
glmnet_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(glmnet_plan)
rf_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(rf_plan)
xgb_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(XGB_plan)# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
lm_tuned <- lm_wf %>%
fit_resamples(.,
resamples = cv_splits_geo,
control = control,
metrics = metric_set(rmse, rsq))
glmnet_tuned <- glmnet_wf %>%
tune_grid(.,
resamples = cv_splits_geo,
grid = glmnet_grid,
control = control,
metrics = metric_set(rmse, rsq))
rf_tuned <- rf_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = rf_grid,
control = control,
metrics = metric_set(rmse, rsq))
xgb_tuned <- xgb_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = xgb_grid,
control = control,
metrics = metric_set(rmse, rsq))
show_best(lm_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned, metric = "rmse", n = 15, maximize = FALSE)
lm_best_params <- select_best(lm_tuned, metric = "rmse", maximize = FALSE)
glmnet_best_params <- select_best(glmnet_tuned, metric = "rmse", maximize = FALSE)
rf_best_params <- select_best(rf_tuned, metric = "rmse", maximize = FALSE)
xgb_best_params <- select_best(xgb_tuned, metric = "rmse", maximize = FALSE)#Final workflow
lm_best_wf <- finalize_workflow(lm_wf, lm_best_params)
glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf <- finalize_workflow(xgb_wf, xgb_best_params)
lm_val_fit_geo <- lm_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
glmnet_val_fit_geo <- glmnet_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
rf_val_fit_geo <- rf_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
xgb_val_fit_geo <- xgb_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned)
glmnet_best_OOF_preds <- collect_predictions(glmnet_tuned) %>%
filter(penalty == glmnet_best_params$penalty[1] & mixture == glmnet_best_params$mixture[1])
rf_best_OOF_preds <- collect_predictions(rf_tuned) %>%
filter(mtry == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])
xgb_best_OOF_preds <- collect_predictions(xgb_tuned) %>%
filter(mtry == xgb_best_params$mtry[1] & min_n == xgb_best_params$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo <- collect_predictions(lm_val_fit_geo)
glmnet_val_pred_geo <- collect_predictions(glmnet_val_fit_geo)
rf_val_pred_geo <- collect_predictions(rf_val_fit_geo)
xgb_val_pred_geo <- collect_predictions(xgb_val_fit_geo)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds <- rbind(data.frame(dplyr::select(lm_best_OOF_preds, .pred, mean_on), model = "lm"),
data.frame(dplyr::select(glmnet_best_OOF_preds, .pred, mean_on), model = "glmnet"),
data.frame(dplyr::select(rf_best_OOF_preds, .pred, mean_on), model = "RF"),
data.frame(dplyr::select(xgb_best_OOF_preds, .pred, mean_on), model = "xgb")) %>%
group_by(model) %>%
mutate(.pred = exp(.pred),
mean_on = exp(mean_on),
RMSE = yardstick::rmse_vec(mean_on, .pred),
MAE = yardstick::mae_vec(mean_on, .pred),
MAPE = yardstick::mape_vec(mean_on, .pred)) %>%
ungroup()
# Aggregate predictions from Validation set
val_preds <- rbind(data.frame(lm_val_pred_geo, model = "lm"),
data.frame(glmnet_val_pred_geo, model = "glmnet"),
data.frame(rf_val_pred_geo, model = "rf"),
data.frame(xgb_val_pred_geo, model = "xgb")) %>%
left_join(., data %>%
rowid_to_column(var = ".row") %>%
dplyr::select(TRUSTEE, .row),
by = ".row") %>%
group_by(model) %>%
mutate(.pred = exp(.pred),
mean_on = exp(mean_on),
RMSE = yardstick::rmse_vec(mean_on, .pred),
MAE = yardstick::mae_vec(mean_on, .pred),
MAPE = yardstick::mape_vec(mean_on, .pred)) %>%
ungroup()###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds %>%
dplyr::select(model, MAPE) %>%
distinct() ,
aes(x = model, y = MAPE, group = 1)) +
geom_path(color = "blue") +
geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
labs(title= "MAPE of each model on the testing set")
theme_bw()
#MAE chart
ggplot(data = val_preds %>%
dplyr::select(model, MAE) %>%
distinct() ,
aes(x = model, y = MAE, group = 1)) +
geom_path(color = "blue") +
geom_label(aes(label = paste0(round(MAE,1)))) +
labs(title= "MAE of each model on the testing set")
theme_bw()
#Predicted vs Observed
ggplot(val_preds, aes(x =.pred, y = mean_on, group = model)) +
geom_point() +
geom_abline(linetype = "dashed", color = "red") +
geom_smooth(method = "lm", color = "blue") +
coord_equal() +
facet_wrap(~model, nrow = 2) +
labs(title="Predicted vs Observed on the testing set", subtitle= "blue line is predicted value")
theme_bw()
#Neighborhood validation
val_MAPE_by_district <- val_preds %>%
group_by(TRUSTEE, model) %>%
summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
MAE = yardstick::mae_vec(mean_on, .pred),
MAPE = yardstick::mape_vec(mean_on, .pred)) %>%
ungroup()
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 20) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 50,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=30),
axis.title = element_text(size=30),
axis.text = element_text(size=20),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic", size= 50),
legend.text = element_text(colour = "black", face = "italic",size = 50),
strip.text.x = element_text(size = 30)
)
}
val_MAPE_by_district %>%
dplyr::select(TRUSTEE, model, MAE) %>%
gather(Variable, MAE, -model, -TRUSTEE) %>%
ggplot(aes(TRUSTEE, MAE)) +
geom_bar(aes(fill = model), position = "dodge", stat="identity") +
scale_fill_manual(values = palette4) +
facet_wrap(~TRUSTEE,scales="free", ncol=8)+
ylim(0,200)+
labs(title = "Mean Absolute Errors by model specification and school districts") +
plotTheme()
#Map of MAE in each school district
#Load school district data
TrusteeDist <-
st_read("C:/Upenn/Practicum/Data/Trustee_Boundaries/Trustee.shp")%>%
st_transform(2278)
#Add geometry to the MAE
MAE.district <- merge(TrusteeDist, val_MAPE_by_district, by.x="TRUSTEE", by.y="TRUSTEE", all.y=TRUE)
#Produce the map
#Map: MAPE of lm
MAE.district%>%
filter(model=="lm") %>%
ggplot() +
# geom_sf(data = nhoods, fill = "grey40") +
geom_sf(aes(fill = q5(MAPE))) +
scale_fill_brewer(palette = "Blues",
aesthetics = "fill",
labels=qBr(MAE.district,"MAPE"),
name="Quintile\nBreaks, (%)") +
labs(title="MAPE of lm in School Districts") +
mapTheme()
#Map: MAPE of glmnet
MAE.district%>%
filter(model=="glmnet") %>%
ggplot() +
# geom_sf(data = nhoods, fill = "grey40") +
geom_sf(aes(fill = q5(MAPE))) +
scale_fill_brewer(palette = "Blues",
aesthetics = "fill",
labels=qBr(MAE.district,"MAPE"),
name="Quintile\nBreaks, (%)") +
labs(title="MAPE of glmnet in School Districts") +
mapTheme()
#MAPE of rf
MAE.district%>%
filter(model=="rf") %>%
ggplot() +
# geom_sf(data = nhoods, fill = "grey40") +
geom_sf(aes(fill = q5(MAPE))) +
scale_fill_brewer(palette = "Blues",
aesthetics = "fill",
labels=qBr(MAE.district,"MAPE"),
name="Quintile\nBreaks, (%)") +
labs(title="MAPE of rf in School Districts") +
mapTheme()
#MAPE of xgb
MAE.district%>%
filter(model=="xgb") %>%
ggplot() +
# geom_sf(data = nhoods, fill = "grey40") +
geom_sf(aes(fill = q5(MAPE))) +
scale_fill_brewer(palette = "Blues",
aesthetics = "fill",
labels=qBr(MAE.district,"MAPE"),
name="Quintile\nBreaks, (%)") +
labs(title="MAPE of xgb in School Districts") +
mapTheme()